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Abstract. This paper presents an improved analysis of a structured dimension- reduction map 
called the subsampled randomized Hadamard transform. This argument demonstrates that the 
map preserves the Euclidean geometry of an entire subspace of vectors. The new proof is much 
simpler than previous approaches, and it offers — for the first time — optimal constants in the estimate 
on the number of dimensions required for the embedding. 



1. Introduction 

Dimension reduction is an elegant idea from computer science that has found applications in 
numerical linear algebra. Here is the basic concept: it is often more efficient to solve a compu- 
tational problem presented in a high-dimensional space if we first transport the problem instance 
to a lower-dimensional space while preserving its essential structure. Researchers have found that 
randomness provides an extraordinarily effective way to construct these dimension-reduction maps. 
This approach is usually traced to the celebrated paper of Johnson and Lindenstrauss | JL84] . 

In randomized algorithms for matrix approximation, the goal of dimension reduction is to find 
a low-dimensional subspace that captures most of the action of an input matrix. One way to 
accomplish this task is to multiply the input matrix A by a relatively small dimension-reduction 
matrix fi to obtain Y = Aft. We then perform a QR factorization Y = QR to identify the 
range of the reduced matrix Y. For an appropriately designed dimension-reduction map, it can 
be shown that A ~ QQ*A with high probability. In words, we can approximate the input matrix 
by compressing it to the range of the reduced matrix. It is then possible to compute standard 
matrix decompositions of A by manipulating the low-rank approximation QQ*A. See the recent 
survey [HMTll] for a comprehensive treatment of these ideas and an extensive bibliography. 

In linear algebra applications, the cost of multiplication by an unstructured random matrix f2, 
such as a Gaussian matrix, can sometimes be prohibitive. In that case, we may prefer to draw 
the random matrix fi from a highly structured distribution that allows us to multiply ft into the 
input matrix substantially faster. Sarlos is credited with bringing structured dimension reduction 
to numerical linear algebra |Sar06| : see also [WLRT08] . 

The subsampled randomized Hadamard transform (SRHT) is a type of structured dimension- 
reduction map that is based on the Walsh-Hadamard matrix. We will prove that the SRHT 
preserves the geometry of an entire subspace of vectors, which is the essential ingredient required 
to show that the SRHT can be used in algorithms for randomized linear algebra. See the discussion 
in [HMTlll Sec. 11] for details. 
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The literature already contains a number of papers, including | AC 091 ILib09l INDT091 IHMTllj 
IAL1H iKWllj . that study the behavior of the SRHT and related dimension reduction maps. The 
current treatment differs in several regards. Here, the main technical difficulties are addressed using 
a version of the matrix Chernoff inequality [AWQ2J I Troll) . As a consequence of the simple proof 
schema, we are able — for the first time — to obtain optimal constants in our bounds. This improve- 
ment can be valuable in numerical applications that require concrete performance guarantees. 

This document is adapted from Appendix B of the technical report [HMT09], but much of the 
proof is new. This paper should be viewed as a codicil to the published version [HMTllj of the 
technical report, which uses the results presented here. 

1.1. Construction of the SRHT Matrix. An SRHT is a wide £ x n matrix of the form 
where 

• D is a random n x n diagonal matrix whose entries are independent random signs, i.e., ran- 
dom variables uniformly distributed on {±1}; 

• H is an n x n Walsh-Hadamard matrix, scaled by n -1 / 2 so it is an orthogonal matrix; 

• R is a random £ x n matrix that restricts an n-dimensional vector to £ coordinates, chosen 
uniformly at random. 

Our analysis relies on two basic properties of the Walsh-Hadamard matrix: The derived matrix 
H is orthogonal, and its entries all have magnitude in" 1 / 2 . Walsh-Hadamard matrices exist for 
each n = 2 P where p = 1,2,3,.... 

Remark 1.1. The Walsh-Hadamard matrix is the real analog of the discrete Fourier matrix. 
Whereas the n x n Fourier matrix displays the characters of the cyclic group of order n, the 
n x n Walsh-Hadamard matrix contains the characters of the additive group lT 2 where n = 2 P . In 
each case, the underlying algebraic structure allows us to multiply the matrix by a vector in about 
nlogn arithmetic operations. Note, however, that our results are purely analytic; they do not rely 
on the algebraic properties of the Walsh-Hadamard matrix. We have chosen not to work with the 
Fourier matrix to avoid some small complications associated with complex random variables. 

1.2. Intuition. The design of the SRHT may seem mysterious, but there are clear intuitions to 
support it |AC09| . Suppose that we want to estimate the energy (i.e., squared £2 norm) of a fixed 
vector x by sampling £ of its n entries at random. On average, these random entries carry £/n of 
the total energy. (The factor \Jnji reverses this scaling.) When x has a few large components, 
the variance of this estimator is very high. On the other hand, when the components of x have 
comparable magnitude, the estimator has much lower variance, so it is precise with very high 
probability. 

The purpose of the matrix product HD is to flatten out input vectors before we sample. To see 
why it achieves this goal, fix a unit vector x, and examine the first component of HDx. 

En 

where hij are the components of the matrix H. This random sum clearly has zero mean. Since the 
entries of H have magnitude ra -1 / 2 , the variance of the sum is Hoeffding's inequality [Hoe63] 
shows that 

P{|(JETDaj)i| >t}< 2e- n ' 2 / 2 . 

In words, the magnitude of the first component of HDx is typically about n~ 1 / 2 . The same 
argument applies to the remaining entries. Therefore, it is unlikely that any one of the n components 
of HDx is larger than \J n _1 log(ra). 
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Remark 1.2. This discussion suggests that the precise form of the SRHT is not particularly im- 
portant. Indeed, we can replace H by any unitary matrix whose entries are uniformly small and 
which is equipped with a fast matrix-vector multiply. We can also draw the diagonal entries of D 
from other subgaussian distributions. These changes complicate the analysis somewhat. 

1.3. Main Results. In early work, dimension reduction was accomplished with uniformly random 
projectors or, later, with Gaussian matrices. These random maps preserve, up to a constant factor, 
all the pairwise distances among p points in W n , provided that the embedding dimension is about 
log(p). We can also use a Gaussian matrix to transport an (unknown) fc-dimensional subspace from 
a high-dimensional space to a lower-dimensional space. In this case, it suffices to map the ambient 
space down to O(k) dimensions because we can discretize a fc-dimensional sphere using p = 0(e fc ) 
points. 

It turns out that an appropriately designed SRHT also preserves the geometry of an entire 
subspace of vectors. For this type of structured map, the embedding requires klog(k) dimensions 
because of certain phenomena connected with random sampling (Section I3.3|) . It also becomes 
necessary to use more sophisticated proof techniques. We have the following result. 

Theorem 1.3 (The SRHT preserves geometry). Fix annxk matrix V with orthonormal columns, 
and draw a random t x n SRHT matrix $ where the embedding dimension t satisfies 

1 2 

y/k + A/81og(fcn) log(fc) < £ < n. 

Then, except with probability 0(k~ 1 ), 

0.40 < o- k ($V) and <7i(*V) < 1.48. 
The symbol o~j(-) denotes the jth largest singular value of a matrix. 

The proof of Theorem 11.31 appears below. This result replaces Theorem B.4 of [HMT09 ] . Some 
precedents include |RV071 Thm. 3.1] and |Tro08t Sec. 9]; see also |NDT09| . Earlier results have the 
same structure as Theorem 11.31 but the constants are either exorbitant or absent. 

Remark 1.4. In Theorem[L3j the factor log(fc) in the lower bound on i cannot generally be removed. 
See Section 13.31 for an explanation and an example that demonstrates the necessity. 

Remark 1.5. For large problems, we have been able to obtain optimal numerical constants. Suppose 
that i is a sufficiently small positive number. If k 3> log(n), then sampling 

£> (l + i).jfelog(fc) 

coordinates is sufficient to ensure that $>V has constant condition number. See Theorem 13.21 for a 
more precise statement. The discussion in Section [3.31 indicates that there are cases where it does 
not suffice to draw (1 — l) ■ klog(k) samples. 

2. Technical Background 

In preparation for the main argument, we present our notation and some probability inequalities. 
These inequalities encapsulate all the difficulty in the proof. 

2.1. Notation. We write ej for the jth standard basis vector in M. n . The matrix I is a square 
identity matrix; we sometimes indicate the dimension with a subscript. The symbol * denotes 
transposition. 

Throughout this work, ||-|| refers to the ti vector norm or the associated operator norm. We also 
write || • ||p for the Frobenius norm. 

Given a subset T of indices in {1, 2, . . . , n}, we define the restriction operator Rt : M n — > M T via 
the rule 

(R T x){j)= Xj , j€T. 
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A Rademacher random variable takes the values ±1 with equal probability. We reserve the 
letter e for a Rademacher variable, and we often write e for a vector whose entries are independent 
Rademacher variables. 

2.2. Probability Inequalities. We delegate the hard work to some probability inequalities that 
describe the large-deviation behavior of specific types of random variables. First, we describe a tail 
bound for a convex function of Rademacher variates. This result was established by Ledoux [Led96[ 
Eqn. (1.9)]; see also |Led01[ §5.2] for a discussion of concentration in product spaces. 

Proposition 2.1 (Rademacher tail bound). Suppose f is a convex function on vectors that satisfies 
the Lipschitz bound 

\f(x) - f(y)\ < L \\x - y\\ forallx,y. 
Let e be a Rademacher vector. For all t > 0, 

¥{f(e)>Ef(e) + Lt}<e- t2 / 8 . 

Next, we present a matrix analog of the well-known Chernoff inequality. The proof is based on 
the matrix Laplace transform method proposed in an influential paper of Ahlswede-Winter |AW02j . 
We derive the result using more recent ideas [Troll] . which deliver an essential improvement on 
the earlier work. The other key tool is a method |GN10| . ultimately due to Hoeffding |Hoe63| . 
for transferring results from the model where we sample with replacement to the model where we 
sample without replacement. 

Theorem 2.2 (Matrix Chernoff). Let 2£ be a finite set of positive-semidefinite matrices with 
dimension k, and suppose that 

max Xe ,r A max (X) < B. 
Sample {X\, . . . ,Xg} uniformly at random from without replacement. Compute 

Mmin ■=&• Amin (E X{) and /Umax := I ■ A max (EXi). 

Then 



{A max (£2 Xj^j > (1 + 5)/w} < k ■ 



e~ 5 



„<5 



(1 + 5) 



1+5 



for 5 G [0, 1), and 

A*max/ B 

for 6>0. 



Proof sketch. We establish only the upper bound; the lower bound is established by applying a 
similar method to —^ZjXj. By homogeneity, take B = 1. We use the matrix Laplace transform 
method [TrolU Prop. 3.1] to bound the probability of a large deviation: 



X 3 ) > (1 + OMmax} < inf je- fl C 1+ *>"«« • Etr exp (£. ^j) } • (2-1) 

The Laplace transform bound (|2.1|) is essentially due to Ahlswede and Winter [AW 02], 

Let {Y±, ■ ■ ■ , Y}} be a random sample from X drawn with replacement. Gross and Nesme [GN1Q] 
have shown that the trace of the matrix moment generating function (mgf) of a sample without 
replacement is dominated by the trace of the matrix mgf of a sample with replacement: 

E tr exp ( V . 9Xj) < E tr exp ( V . 6Yj) . (2.2) 



Lemma 5.8 of [Troll] gives a semidefinite bound for the mgf of Yj. (See also [AW021 Thm. 19].) 

Ee^ ^ e^- 1 )^^) = e^" 1 )^^) for 9 G R. (2.3) 
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On the right-hand side of this bound, we have exploited the fact that Yj ~ Y\ ~ X\. Lemma 3.4 
of [Troll] allows us to use the mgf bound (|2.3p to bound the mgf of the entire sum: 

E trexp 0Y j) < trex P (V - 1) ■ £ ■ (E JTi)) < fc • exp (V - 1) • ^ max ) . (2.4) 



Substitute (I2.4p into (|2.2p and introduce the resulting inequality into the probability bound (J2JJ). 
The infimum is achieved at 9 = log(l + 5). □ 

3. The SRHT Preserves Geometry 

In this section, we establish a slightly more specific version of our main result, Theorem 11.31 

Theorem 3.1 (The SRHT preserves geometry). Let V be an n x k matrix with orthonormal 
columns. Select a parameter £ that satisfies 

2 
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\fk + ^J%\o^hn) log(fc) <£<n. 



Draw an I x n SRHT matrix 3>. Then, except with probability 3k , 

i /To" 
-= < o-d^V) and aAQV) < \ — . 

The constants in the sample size £ specified in Theorem 13.11 are somewhat larger than we might 
like because we wanted to ensure that the statement is effective for reasonable values of k and n. 
We also have the following result of a more asymptotic flavor. 

Theorem 3.2 (SRHT: Large sample bounds). Fix a positive number l < c. Let V be an n x k 
matrix with orthonormal columns, where k > Ct~ 2 log(n). Select a parameter £ that satisfies 

(1 + t) ■ fclog(fc) < £ < n. 

Draw an £ x n SRHT 3>. Then 

i < o- k ($V) and a x ($V) < ^/e 
except with probability 0(k~ CL ). The numbers c and C are positive universal constant. 

3.1. Overview of Theorem 13.11 We present the main line of reasoning here, postponing the 
proofs of the lemmata. The first step is to show that the matrix HD equilibrates row norms. 

Lemma 3.3 (Row norms). Let V be an n x k matrix with orthonormal columns. Then HDV is 
an n x k matrix with orthonormal columns, and 



k Slogan 
max ||e J -(i3\DVj|| > y — + 




j=i,...,n " J 11 V n V n 

When k 3> logn, the second term in the bound is negligible, in which case the row norms are 
essentially as small as possible. On the other hand, when k < logn, small sample effects can make 
some row norms large. 

The next result states that randomly sampling rows from a matrix with orthonormal columns 
results in a well-conditioned matrix. The minimum size of the sample depends primarily on the 
row norms, and the sampling procedure is most efficient when the matrix has uniformly small rows. 
We state this result in detail because it may have independent interest. 



Lemma 3.4 (Row sampling). Let W be an n x k matrix with orthonormal columns, and define 

£ > aMlog(k). 



1 1 2 

the quantity M := n • maxj = i }iii)7l e^VF . For a positive parameter a, select the sample size 
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Draw a random subset T from {1,2,..., n} by sampling I coordinates without replacement. Then 



(1 5)1 < o- k (R T W) and aARrW) < J (1 + 7? ^ , 
n V n 

with failure probability at most 



k ■ 



e~ 5 



(1-5) 



1-5 



a log k 

+ k 



(l + r? )(i+r?) 



a log k 



These two lemmata allow us to establish Theorem 13.11 quickly. Recall that V be an n x k matrix 
with orthonormal columns, and draw an I x n SRHT $ = R^HD. 

Define the matrix W = HDV. Lemma 13.31 with f3 = k establishes that W is a matrix with 
orthonormal columns whose largest row norm is essentially as small as possible, so that 

2 



M:=n- max ||e*W|| 2 < Vk + y / 81og(fcn) 



except with probability k . Next, we apply Lemma 13.41 with a = 4, with 5 = 5/6, and with 
r\ = 7/6. A numerical reckoning shows that the additional probability of failure is at most 2k~ l . 
Altogether, we obtain the following result. For 

i > 4Mlog(fc), 

it holds that 



/13^ 

— < a k (R T W) and a^RpW) < \J — 

except with probability 3fc _1 . Since = Wn/I- RtW, we have established Theorem 13. 11 

The same type of argument implies Theorem 13.21 In this case, we fix a sufficiently small positive 
number l. Then we take (3 = k and a = 1 + t/2 and S = 1 — i? and rj = e — 1. We omit the details. 

Remark 3.5. We can establish a slightly stronger sample bound in Theorem 13.21 by choosing the 
parameter i = A/\og(k) for a sufficiently large number A. With this selection, however, we do not 
obtain a constant bound for the lower singular value. 

3.2. Proofs of Supporting Lemmas. It remains to check that the underlying results are true. 
We begin with the claim that HD balances row norms. 

Proof of Lemma \3.3l The orthonormality condition on V is equivalent with the identity V* V = 1^. 
Therefore, ||V^|| = 1 and ||V^|| F = \fk. To check that HDV is also an orthonormal matrix, simply 
compute that 

(HDV)*{HDV) = V*V = I 

because D,H are orthogonal matrices. 

Fix a row index j E {1, 2, . . . , n}, and define the function 

f{x) := ||e*ifdiag(a;)V|| =: ||a:*SV|| . 

We have written E := diag(e^iJ) for the diagonal matrix constructed from the j'th row of H; 

observe that each entry of E has magnitude n" 1 / 2 . The function / is convex, and we quickly 
determine its Lipschitz constant: 

\f(x) - f(y)\ <\\(x- y)*EV\\ < \\x - y\\ \\E\\ \\V\\ = ^j=\\x- y\\ . 



We may use the function / to study the variation of the rows norms of HDV . Recall that 
D := diag(e) for a Rademacher vector e, and consider the random variable 

/(e) = ||e^Jf£>V|| . 
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First, we bound the expectation. 

E/(e) < [E/(e) 2 ] 1/2 = \\EV\\ F < \\E\\ \\V\\ F = ^ 
Apply the Rademacher tail bound, Proposition 12.11 with t = y/8 log(/3n) to reach 

vl\\eUHDV)\\ > < e s^nys = _L 

I 11 J 11 V n V n pn 

This estimate holds for each row index j = 1,2, ... ,n. Finally, take a union bound over these n 
events to reach the advertised conclusion. □ 

Now we establish the result on row sampling. 

Proof of Lemma \3.4\ Let w* denote the jth row of W, and define M := n • maxj ||tOj || 2 . 

We can control the extreme singular values of the random matrix R?W by bounding the extreme 
eigenvalues of the k x k Gram matrix 

Y := (R T Wy(R T W) = ^2 jeT W jW *. 

Recall that T is a random subset of {1, 2, . . . , n} consisting of i coordinates sampled without re- 
placement. Therefore, we may just as well view Y as a sum of i random matrices X±, . . . ,Xp 
sampled without replacement from the family X := {wjw* : j = 1,2, ... ,n} of positive semidefi- 
nite matrices with dimension k. 

The matrix Chernoff bound, Theorem 12. 2^ allows us to obtain large deviation bounds for the 
extreme eigenvalues of Y. We just need to determine the parameters involved in the statement. 
First, note that 

2 M 

\ m zx{wjW*) = \\wj\\ < — for j = 1,2, ... ,n. 

n 

We easily compute the expectation of the first sample using the fact that the columns of W are 
orthonormal: 



1 x-^n 1 1 

EXi = -> WjW* = —W*W = —I. 



1 i 

=1 J n n 



As a consequence, we obtain 

i I 

/^min — and /imax — • 

n n 

The lower Chernoff bound yields 



The upper Chernoff bound yields 



II M 

AminCn <(l-S)-\<k 



e~ 5 



(1-5) 



1-5 



for 5 G [0,1). 



AmaxOn >(l + S)-\<k- 

n 



e <5 



(1 + 5) 



l+<5 



£/M 

for 5 > 0. 



Substitute the identities 

A mi „(>0 = o- k (R T W) 2 and A max (Y) = a^RrW) 2 . 
Simplify the formulae to complete the proof. □ 
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3.3. Collecting Coupons. The logarithmic factor in these results is necessary, as we show by 
example. This discussion is extracted from jHMTlll Sec. 11]. 

Fix an integer k, and set n = k 2 . Form annxi orthonormal matrix W by regular decimation 
of the n x n identity matrix. More precisely, W is the matrix whose jth row has a unit entry in 
column 1 + ( j — l)/k when j = 1 (mod k) and is zero otherwise. To see why this type of matrix is 
inconvenient, it is helpful to consider an auxiliary matrix V = HDW . Observe that, up to scaling 
and modulation of rows, V consists of k copies of a k x k Walsh-Hadamard transform stacked 
vertically. 

Now, suppose that we apply an SRHT = RHD to the matrix W . We obtain a matrix of the 
form X = = RV, which consists of i random rows sampled from V. Theorem 13.11 cannot 

hold unless a^X) > 0. To ensure the latter event takes place, we must select at least one copy each 
of the k distinct rows of W. This is the coupon collector's problem [MR.95I Sec. 3.6]. To obt am a 
complete set of k rows with nonnegligible probability, we must sample at least klog(k) rows. The 
fact that we are sampling without replacement does not improve the analysis appreciably because 
the matrix has too many rows. 
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